Public companies, domestic and foreign, are required to provide filings to the U.S. Securities and Exchange Commission (SEC) in which they contain company financial and operational information in different forms and periods such as quarterly reports (10-Q), annual reports (10-K). In addition, the filing process is managed through EDGAR, which is the Electronic Data Gathering, Analysis, and Retrieval system. This provides advantages to investors, companies, and the U.S. economy by enhancing transparency, fairness, and efficiency of securities markets.
Furthermore, the annual report, 10-K form, yields companies a great opportunity to illustrate their business information, risk analysis, financial data, and management’s discussion and analysis (MD&A), which is the only part that companies express business performance by not conforming with the Generally Accepted Accounting Principles (GAAP) . This possibly results in a positive bias in which companies specifically use positive words in the discussion part to avoid negative effect on stock price. In fact, [@Taylor] found that the sentiment diminishes after the non-GAAP texts are removed. Consequently, this leads to the questions whether the analysis from MD&A part could generally provide meaningful insights to investors and the effect of the text sentiment per se.
Therefore, the purpose of this report is to extract insights from the MD&A part of S&P 500 companies in Information and Technology sector (GICS) and analyze them to predict stock price value by using content-based analysis approach i.e., text mining. In addition, the approach was firstly developed by [@Loughran], who conducted textual analysis of 10-Ks and the effect on stock returns, particularly in the financial context, and eventually developed a dictionary used in the textual analysis. Furthermore, [@Chouliaras] concluded that 10-K negative polarity affected stock price after filing on a monthly basis. Particularly, [@Ahern] evidenced that firms who manipulate media coverage gain an advantage of increasing better stock prices, which signifies textual sentiment impact on the stock price.
The rest of this report is organized as follows. Section 2 provides an overall design of textual analysis – Natural Language Processing (NLP) pipelines. Next, in section 3 presents how corpus is constructed and general insights using bag-of-words and tidy text approaches [@Silge]. In addition, section 4 will provide how text features such as polarity formality; are extracted and the effect on the stock price. Lastly, section 5 presents the analysis based on topic modelling – how topics are retrieved and the analysis of marginal effect of the topics on the stock price.
The data is collected from https://www.sec.gov/edgar.shtml in which 10-K forms are retrieved, specifically for the MD&A section for 30 companies, from 2010 to 2020 depending on availability on EDGAR. Further explanation on data sampling and company selection criteria will be provided in section 3. After that, Natural Language processing pipelines are created beginning with text cleaning and metadata retrieval. Next, lemmatization and pos-tagging are applied to the text data to reduce word duplication and enhance word understanding or semantic coherence. The last part in the section 3 will provide bag of words analysis using TF-IDF approach1 for context understanding purpose.
Moreover, sentiment analysis in section 4 will present how to retrieve important features from the text data, which are polarity, formality, reading level, and stock price. The analysis in this part will discuss proper dictionary to use in the financial context; descriptive analysis of the stock price relationship with other features; and the effect of these features on the stock price using regression.
Lastly, in section 5, the text data is processed and cleaned for a corpus construction used in topic modelling following the similar approach to [@Schmiedel], who demonstrated the advantage of using structural topic modelling (STM). This section will provide the discussion of the optimal number of topics and topic evaluation including the relationship of features to topics using regression analysis. The approach is summarized in the Figure 1: Summary of analysis approach.
Figure 1 Summary of analysis approach
In this section, it presents how the data is downloaded from EDGAR, specifically for the MD&A part in 10-K forms, and parsed into a data frame. It also illustrates the pipelines starting with data cleaning, lemmatization and pos-tagging. After that, the analysis is split into 2 routes, unigram and bi-grams where it evidences how tokenization process is implemented including the insights based on TF-IDF approach. The approach summary can be found in Figure 2: Summary of corpus construction approach below.
Figure 2 Summary of corpus construction approach
The purpose of this project is to quantitatively predict the impact of sentiment and topics on the change in stock price. Therefore, the data for such analysis is collected from https://www.sec.gov/edgar.shtml where 10-K forms of 30 companies from S&P 500 companies, specifically in the Information Technology sector (GICS), are randomly selected. In addition, the additional criterion of company selection is based on a number of companies in each sub industry, in which top 5 sub industries with the highest number of companies are selected, which are Semiconductors (Semi Con); Data Processing & Outsourced Services (DP); Application Software (App); Technology, Hardware (Tech Hardware), Storage & Peripherals; and IT Consulting & Other Services (IT Consult). The summary is presented in Figure 3: Summary of the number of companies in IT sector below.
For each sub industry, companies are randomly selected, and the data, specifically in the MD&A part is collected from 2010 to 2020, where it is available. The list of companies, sub industry, and sample size are summarized in the Figure 4: Sample size of each company across sub industries below.
Figure 4 Sample size of each company across sub industries
# set up a user for EDGAR to prevent errors of using automated tools
Sys.setenv("User-Agent"="EDGARWEBR_USER_AGENT")
# create lists of companies across sub industries using CIK numbers
app_software = c("796343", "883241", "896878", "1108524", "1341439", "1013462")
dp_outsource = c("1175454","723531","1101215","1123360","1141391","1365135","1403161","1136893")
it_con = c("749251","1058290","1467373","1336920")
semi_con = c("2488","50863","743316","804328","827054","1045810","4127","1730168")
tech_hw = c("108772","320193","106040","1137789")
cik_list = c(app_software, dp_outsource, it_con, semi_con, tech_hw)
# retrieve indexes and metadata across companies in the cik list
edgar::getMasterIndex(2010:2020)
indexes = data.frame()
index_files = list.files("Master Indexes/", pattern="Rda")
for(i in index_files){
load(paste0("Master Indexes/",i))
this_index <- year.master
indexes <- bind_rows(indexes,this_index)
print(i)
}
# to cross check available data
indexes_2 = indexes %>% filter(cik %in% cik_list, form.type == "10-K")
# create system sleep function
testit <- function(x){
p1 <- proc.time()
Sys.sleep(x)
proc.time() - p1
}
for(i in cik_list){
print(i)
for(y in c(2010:2020)){
print(y)
getFilings(cik.no = i, filing.year = y,form.type = "10-K")
testit(10)
}
}
# create a function to retrieve only management discussion part
getMgmtDisc_2 <- function(cik.no, filing.year) {
f.type <- c("10-K", "10-K405","10KSB", "10KSB40")
# 10-K, 10-K405, 10-KSB, 10-KT, 10KSB, 10KSB40, and 10KT405 filings in the EDGAR database
# check the year validity
if (!is.numeric(filing.year)) {
cat("Please check the input year.")
return()
}
output <- getFilings(cik.no = cik.no, form.type = f.type , filing.year,
quarter = c(1, 2, 3, 4), downl.permit = "y")
if (is.null(output)){
cat("No annual statements found for given CIK(s) and year(s).")
return()
}
cat("Extracting 'Item 7' section...\n")
progress.bar <- txtProgressBar(min = 0, max = nrow(output), style = 3)
# function for text cleaning
CleanFiling2 <- function(text) {
text <- gsub("[[:digit:]]+", "", text) ## remove Alphnumerics
text <- gsub("\\s{1,}", " ", text)
text <- gsub('\"',"", text)
#text <- RemoveStopWordsFilings(text)
return(text)
}
new.dir <- paste0("MD&A section text")
dir.create(new.dir)
output$extract.status <- 0
output$company.name <- toupper(as.character(output$company.name))
output$company.name <- gsub("\\s{2,}", " ",output$company.name)
for (i in 1:nrow(output)) {
f.type <- gsub("/", "", output$form.type[i])
cname <- gsub("\\s{2,}", " ",output$company.name[i])
year <- output$filing.year[i]
cik <- output$cik[i]
date.filed <- output$date.filed[i]
accession.number <- output$accession.number[i]
dest.filename <- paste0("Edgar filings_full text/Form ", f.type,
"/", cik, "/", cik, "_", f.type, "_",
date.filed, "_", accession.number, ".txt")
# read filing
filing.text <- readLines(dest.filename)
# extract data from first <DOCUMENT> to </DOCUMENT>
tryCatch({
filing.text <- filing.text[(grep("<DOCUMENT>", filing.text, ignore.case = TRUE)[1]):(grep("</DOCUMENT>",
filing.text, ignore.case = TRUE)[1])]
}, error = function(e) {
filing.text <- filing.text ## In case opening and closing DOCUMENT TAG not found, cosnider full web page
})
# See if 10-K is in XLBR or old text format
if (any(grepl(pattern = "<xml>|<type>xml|<html>|10k.htm", filing.text, ignore.case = T))) {
doc <- XML::htmlParse(filing.text, asText = TRUE, useInternalNodes = TRUE, addFinalizer = FALSE)
f.text <- XML::xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]",
XML::xmlValue)
f.text <- iconv(f.text, "latin1", "ASCII", sub = " ")
## free up htmlParse document to avoid memory leakage, this calls C function
#.Call('RS_XML_forceFreeDoc', doc, package= 'XML')
} else {
f.text <- filing.text
}
# preprocessing the filing text
f.text <- gsub("\\n|\\t|$", " ", f.text)
f.text <- gsub("^\\s{1,}", "", f.text)
f.text <- gsub(" s ", " ", f.text)
# check for empty Lines and delete it
empty.lnumbers <- grep("^\\s*$", f.text)
if (length(empty.lnumbers) > 0) {
f.text <- f.text[-empty.lnumbers] ## remove all lines only with space
}
# get MD&A sections
startline <- grep("^Item\\s{0,}7[^A]", f.text, ignore.case = TRUE)
endline <- grep("^Item\\s{0,}7A", f.text, ignore.case = TRUE)
# if dont have Item 7A, then take upto Item 8
if (length(endline) == 0) {
endline <- grep("^Item\\s{0,}8", f.text, ignore.case = TRUE)
}
md.dicusssion <- NA
if (length(startline) != 0 && length(endline) != 0) {
startline <- startline[length(startline)]
endline <- endline[length(endline)] - 1
md.dicusssion <- paste(f.text[startline:endline], collapse = " ")
md.dicusssion <- gsub("\\s{2,}", " ", md.dicusssion)
#words.count <- str_count(md.dicusssion, pattern = "\\S+")
#md.dicusssion <- gsub(" co\\.| inc\\.| ltd\\.| llc\\.| comp\\.", " ", md.dicusssion, ignore.case = T)
#md.dicusssion2 <- unlist(strsplit(md.dicusssion, "\\. "))
#md.dicusssion2 <- paste0(md.dicusssion2, ".")
#md.dicusssion <- CleanFiling2(md.dicusssion)
header <- paste0("CIK: ", cik, "\n", "Company Name: ", cname, "\n",
"Form Type : ", f.type, "\n", "Filing Date: ", date.filed, "\n",
"Accession Number: ", accession.number)
md.dicusssion <- paste0(header, "\n\n\n", md.dicusssion)
}
if( (!is.na(md.dicusssion)) ){
filename2 <- paste0(new.dir, '/',cik, "_", f.type, "_", date.filed,
"_", accession.number, ".txt")
writeLines(md.dicusssion, filename2)
output$extract.status[i] <- 1
}
# update progress bar
setTxtProgressBar(progress.bar, i)
}
## convert dates into R dates
output$date.filed <- as.Date(as.character(output$date.filed), "%Y-%m-%d")
# close progress bar
close(progress.bar)
output$quarter <- NULL
output$filing.year <- NULL
names(output)[names(output) == 'status'] <- 'downld.status'
cat("MD&A section texts are stored in 'MD&A section text' directory.")
return(output)
}
# download management discussion data
df = data.frame()
for(i in as.numeric(cik_list)){
print(i)
for(y in 2010:2020){
print(y)
data = getMgmtDisc_2(cik.no = i, filing.year = y)
df = rbind(df, data)
testit(10)
}
}
# management discussion data frame
text_df = data.frame()
txt_files = list.files("MD&A section text/", pattern="txt")
for(i in 1:length(txt_files)){
test = readLines(paste0("MD&A section text/",txt_files[i]))
cik = test[1]
comp_name = test[2]
form_type = test[3]
filing_date = test[4]
access_no = test[5]
text = test[8]
rows <- data.frame(cik,comp_name,form_type,filing_date,access_no,text)
text_df <- bind_rows(text_df,rows)
print(i)
}
# create a function extracting only numbers
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
# create a function for text cleaning
cleanse_text = function(c) {
c = tolower(c)
c = iconv(c)
c = gsub("[[:punct:][:blank:]]+", " ", c)
c = stringr::str_replace_all(c, "\r", "")
c = trimws(c)
# remove mentions, urls, emojis, numbers, punctuations, etc.
c <- gsub("@\\w+", "", c)
c <- gsub("https?://.+", "", c)
c <- gsub("\\d+\\w*\\d*", "", c)
c <- gsub("#\\w+", "", c)
c <- gsub("[^\x01-\x7F]", "", c)
# remove spaces and newlines
c <- gsub("\n", " ", c)
c <- gsub("^\\s+", "", c)
c <- gsub("\\s+$", "", c)
c <- gsub("[ |\t]+", " ", c)
return(c)
}
text_df$cik = as.numeric(str_extract(text_df$cik,pattern="[0-9]+"))
text_df$comp_name = sub(".*:", "", text_df$comp_name)
text_df$comp_name = trimws(gsub('"', "", text_df$comp_name))
text_df$form_type = sub(".*:", "", text_df$form_type)
text_df$filing_date = as.Date(sub(".*:", "", text_df$filing_date))
text_df$access_no = numextract(text_df$access_no)
text_df$text = cleanse_text(text_df$text)
# read the text data
ticker = read.csv("ticker.txt",sep = " ", header = FALSE)
ticker$code = as.numeric(numextract(ticker$V1))
ticker$comp = sub("\\s+\\S+$", '', ticker$V1)
ticker = ticker[-1]
# join the data with the previous data frame
text_df2 = text_df %>% inner_join(ticker, by = c("cik" = "code"))
# feature engineering: create new columns: pre-filing and post-filing dates
text_df2$date_before <- text_df2$filing_date - 7
text_df2$date_after <- text_df2$filing_date + 5
text_df2$date_after_wd <- with(text_df2, date_after - setNames(rep(0:2, c(5, 1, 1)), 1:7)[format(date_after, "%u")]) # force the post-filing date to be only weekdays
# after checking the data, the company code with CIK 1730168 is duplicate (avgo vs avgop)
# hence, it is removed
text_df2 = text_df2 %>% filter(!comp == "avgop")
# save the file
saveRDS(text_df2, file = "text_df2.rds")
langmodel_download <- udpipe::udpipe_download_model("english")
langmodel <- udpipe::udpipe_load_model(langmodel_download$file_model)
postagged <- udpipe_annotate(langmodel,
text_df2$text,
parallel.cores = 10,
trace = 100)
# create a data frame of pos-tagged texts
postagged <- as.data.frame(postagged)
head(postagged)
# save the working file as .rds
saveRDS(postagged,file = "postagged.rds")
# read data
postagged = readRDS("postagged.rds")
# filter only nouns, adj, and adv
lematized <- postagged %>% filter(upos %in% c("NOUN",
"ADJ",
"ADV")) %>% select(doc_id,lemma) %>% group_by(doc_id) %>% summarise(documents_pos_tagged = paste(lemma,collapse = " "))
## `summarise()` ungrouping output (override with `.groups` argument)
# create unique id
text_df2 <- text_df2 %>% mutate(doc_id = paste0("doc",row_number()))
# combine tables
text_df3 <- text_df2 %>% left_join(lematized)
## Joining, by = "doc_id"
# save the working file as .rds
saveRDS(text_df3,file = "text_df3.rds")
# check NA values across columns
which(colSums(is.na(text_df3)) > 0)
## documents_pos_tagged
## 12
# word tokenization
token_df_all <- text_df3 %>% unnest_tokens(word,documents_pos_tagged)
# check the language and remove non-English words
token_df_all$lan <- cld3::detect_language(token_df_all$word)
token_df_all <- token_df_all %>% filter(lan=="en")
# check the language using different approach: hunspell
spell_check = hunspell_check(token_df_all$word)
token_df_all = token_df_all[spell_check,]
token_df_all = token_df_all %>% select(-text)
saveRDS(token_df_all,file = "token_df_all.rds")
# read data
token_df_all = readRDS("token_df_all.rds")
# calculate the token length
token_df_all$token_length <- nchar(token_df_all$word)
# let's have a look on the distribution
token_df_all %>% group_by(token_length) %>% summarise(total =n()) %>% ggplot(aes(x=token_length, y=total)) + geom_col() + labs(x="Token length", y="Frequency", subtitle="Distribution of Token Length")
## `summarise()` ungrouping output (override with `.groups` argument)
# let's have a look at the distribution of tokens again
token_df_all %>% group_by(token_length) %>%
summarise(total =n()) %>%
arrange(desc(token_length))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 16 x 2
## token_length total
## <int> <int>
## 1 18 2
## 2 17 7
## 3 15 32
## 4 14 214
## 5 13 2241
## 6 12 1646
## 7 11 9274
## 8 10 14334
## 9 9 8488
## 10 8 27454
## 11 7 35013
## 12 6 27779
## 13 5 22522
## 14 4 24541
## 15 3 4331
## 16 2 491
outliers1 <- boxplot(token_df_all$token_length, ylab = "Token length")$out
# drop the rows containing outliers
token_df_all <- token_df_all[!token_df_all$token_length %in% outliers1,]
# inspect top frequent words
token_df_all %>% count(word, sort=T)
# remove null values in words
token_df_all = token_df_all[!is.na(token_df_all$word),]
# load stop words
data("stop_words")
token_df2 = token_df_all %>% anti_join(stop_words, by = "word")
token_df2 %>% count(word, sort=T)
# generate new stop words
mystopwords <- tibble(word =c("fiscal", "other", "related", "business", "rate", "product", "note", "payment", "table", "contract", "foreign", "currency", "such", "charge", "time", "purchase", "end","research","support", "policy", "available","party", "lease", "requirement", "recognition", "most","significant", "proceed","solutions","excess","issuance","connection","government","day","act","strategy","state","difference","march","country","experience","consulting","action","page","NA", "revenue", "income","tax","cash","cost","net","financial","net","asset","expense","result","total","share","service","company","increase","management","sale"))
# combine data sets
token_df2 = token_df2 %>% anti_join(mystopwords, by = "word")
# explore top words across corpus
token_df2 %>% group_by(word) %>%
summarise(total =n()) %>%
arrange(desc(total)) %>%
top_n(10)
## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by total
## # A tibble: 10 x 2
## word total
## <chr> <int>
## 1 intangible 2789
## 2 potential 991
## 3 applicable 814
## 4 materially 760
## 5 commercial 619
## 6 comparable 609
## 7 channel 607
## 8 ongoing 596
## 9 collection 586
## 10 delivery 576
# save file
saveRDS(token_df2,file = "token_df2.rds")
listing_words = token_df2 %>% count(doc_id, word, sort=T)
total_words <- listing_words %>%
group_by(doc_id) %>%
summarize(total = sum(n))
## `summarise()` ungrouping output (override with `.groups` argument)
listing_words <- listing_words %>%
left_join(total_words)
## Joining, by = "doc_id"
listing_tf_idf <- listing_words %>% bind_tf_idf(word,doc_id,n) %>% select(-total)
# plot a histogram
hist(listing_tf_idf$tf_idf,breaks = 80,main="TF-IDF plot", xlim = c(0,1))
listing_tf_idf <- listing_tf_idf %>%
filter(tf_idf<=0.2)
# plot a histogram
hist(listing_tf_idf$tf_idf,breaks = 80,main="TF-IDF plot", xlim = c(0,0.2))
listing_tf_idf_2 <- listing_tf_idf %>%
filter(tf_idf<=0.2)
# statistical summary
summary(listing_tf_idf_2$tf_idf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0002382 0.0040371 0.0077939 0.0115862 0.0141520 0.1993487
# let's explore the top 10 words with the highest tf-idf
listing_tf_idf_2 %>%
group_by(word) %>%
arrange(desc(tf_idf)) %>%
top_n(10)
## Selecting by tf_idf
## # A tibble: 3,731 x 6
## # Groups: word [577]
## doc_id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 doc52 absolute 11 0.0780 2.56 0.199
## 2 doc99 contraction 18 0.0769 2.49 0.192
## 3 doc67 spin 10 0.0763 2.49 0.190
## 4 doc83 discover 9 0.0570 3.25 0.185
## 5 doc18 inference 3 0.0395 4.63 0.183
## 6 doc100 contraction 19 0.0706 2.49 0.176
## 7 doc129 booking 12 0.0764 2.28 0.175
## 8 doc101 contraction 20 0.0692 2.49 0.173
## 9 doc85 acquirer 12 0.0732 2.33 0.171
## 10 doc128 booking 12 0.0745 2.28 0.170
## # ... with 3,721 more rows
# let's explore the top 10 words with the lowest tf-idf
listing_tf_idf_2 %>%
group_by(word) %>%
arrange(tf_idf) %>%
top_n(10)
## Selecting by tf_idf
## # A tibble: 3,731 x 6
## # Groups: word [577]
## doc_id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 doc15 conjunction 2 0.0198 0.157 0.00312
## 2 doc91 conjunction 2 0.0213 0.157 0.00335
## 3 doc1 conjunction 3 0.0224 0.157 0.00352
## 4 doc175 materially 3 0.0469 0.0756 0.00354
## 5 doc70 conjunction 3 0.0227 0.157 0.00358
## 6 doc170 sufficient 2 0.025 0.146 0.00365
## 7 doc148 sufficient 2 0.0253 0.146 0.00370
## 8 doc149 sufficient 2 0.0256 0.146 0.00375
## 9 doc185 materially 11 0.0505 0.0756 0.00381
## 10 doc81 sufficient 4 0.0261 0.146 0.00382
## # ... with 3,721 more rows
outliers1.1 <- boxplot(listing_tf_idf_2$tf_idf, ylab = "TF-IDF")$out
summary(listing_tf_idf_2$tf_idf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0002382 0.0040371 0.0077939 0.0115862 0.0141520 0.1993487
# drop the rows containing outliers
listing_tf_idf_3 <- listing_tf_idf_2 %>% filter(tf_idf <= 0.1 & tf_idf >= 0.0002)
listing_tf_idf_3 %>%
group_by(word) %>%
arrange(desc(tf_idf)) %>%
top_n(10)
## Selecting by tf_idf
## # A tibble: 3,714 x 6
## # Groups: word [577]
## doc_id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 doc171 discontinued 5 0.0617 1.61 0.0996
## 2 doc82 limited 13 0.129 0.774 0.0996
## 3 doc146 unspecified 7 0.0530 1.86 0.0987
## 4 doc13 game 9 0.0446 2.19 0.0977
## 5 doc123 converted 14 0.0287 3.38 0.0972
## 6 doc30 petition 4 0.0284 3.38 0.0959
## 7 doc181 conference 3 0.0333 2.84 0.0948
## 8 doc102 unfunded 8 0.0491 1.93 0.0946
## 9 doc145 unspecified 7 0.0504 1.86 0.0938
## 10 doc188 stable 11 0.0512 1.83 0.0937
## # ... with 3,704 more rows
token_df3 = token_df2 %>% semi_join(listing_tf_idf_3, by = c("doc_id","word"))
# save file
saveRDS(token_df3,file = "token_df3.rds")
token_df3 = token_df3 %>%
mutate(industry=case_when(
cik %in% as.numeric(dp_outsource) ~ "Data Processing & Outsourced Services",
cik %in% as.numeric(app_software) ~ "Application Software",
cik %in% as.numeric(it_con) ~ "IT Consulting & Other Services",
cik %in% as.numeric(semi_con) ~ "Semiconductors",
cik %in% as.numeric(tech_hw) ~ "Technology Hardware, Storage & Peripherals"
)) %>%
mutate(industry=industry %>% as.factor())
# word counts
industry_words <- token_df3 %>%
count(industry, word, sort = TRUE)
# total word counts
total_industry_words <- industry_words %>%
group_by(industry) %>%
summarize(total = sum(n))
# left join
industry_words <- industry_words %>%
left_join(total_industry_words)
# visualization of the frequency
industry_words %>%
mutate(tf = n/total) %>%
ggplot(aes(x=tf,fill=industry)) +
geom_histogram(show.legend = FALSE)+
facet_wrap(~industry,ncol=3,scales = "free_y")
From the figure below, we can see that words with low frequency are the most common across sub industries.
Let’s apply Zipf’s law
industry_words %>%
group_by(industry) %>%
mutate(rank = row_number(),
tf = n/total) %>%
ungroup() -> zipf_data
# plot using Zipf's law
zipf_data %>%
ggplot(aes(rank, tf, color = industry)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = TRUE) +
scale_x_log10() +
scale_y_log10()
Notice that the significant deviation among sub industries in the low ranks is uncommon. In addition, it signifies that MD&A discussion uses a lower percentage of the most common words than many collections of language.
Inspect the distribution of TF-IDF values.
# calculating TF-IDF scores
industry_tf_idf <- industry_words %>%
bind_tf_idf(word,industry,n) %>%
select(-total)
options(scipen = 1000)
# tf-idf frequency
industry_tf_idf %>% ggplot(aes(tf_idf)) + geom_histogram(bins = 50)
# check the statistical summary
summary(industry_tf_idf$tf_idf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000000 0.00000000 0.00005751 0.00029156 0.00034059 0.01073434
# create own function to wrap texts
swr = function(string, nwrap=20) {
paste(strwrap(string, width=nwrap), collapse="\n")
}
swr = Vectorize(swr)
industry_tf_idf %>%
group_by(industry) %>%
slice_max(tf_idf,n=10,with_ties = F) %>%
ungroup() %>%
mutate(industry = swr(industry), row = row_number(), word2 = fct_reorder(word, tf_idf)) -> n
n %>%
ggplot(aes(tf_idf, word2, fill = industry)) +
geom_col(show.legend = FALSE) +
facet_wrap(~industry, ncol = 3, scales = "free") +
labs(x = "tf-idf", y = "Top 10 Words")
token_df_bi <- text_df3 %>%
unnest_tokens(bigram, documents_pos_tagged, token="ngrams",n=2) %>%
separate(bigram, c("word1","word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word1 %in% mystopwords$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word2 %in% mystopwords$word) %>%
unite(bigram, word1, word2, sep = " ")
# check the language again and remove non-English words
token_df_bi$lan <- cld3::detect_language(token_df_bi$bigram)
token_df_bi <- token_df_bi %>% filter(lan=="en")
# remove unused features
token_df_bi %>% select(-text,-lan) -> token_df_bi
# save file
saveRDS(token_df_bi,file = "token_df_bi.rds")
# top 20 words
token_df_bi %>%
count(bigram) %>%
mutate(bigram=reorder(bigram,n)) %>%
slice_max(n,n=20) %>%
ggplot(aes(bigram,n)) +
geom_bar(stat = "identity") +
xlab(NULL) +
coord_flip()
# calculating TF-IDF scores
bi_words = token_df_bi %>% count(doc_id, bigram, sort=T)
total_words <- bi_words %>%
group_by(doc_id) %>%
dplyr::summarize(total = sum(n))
## `summarise()` ungrouping output (override with `.groups` argument)
# joining data
bi_words <- bi_words %>%
left_join(total_words)
## Joining, by = "doc_id"
bi_tf_idf <- bi_words %>% bind_tf_idf(bigram,doc_id,n) %>% select(-total)
# let's explore the top 10 words with the lowest tf-idf
bi_tf_idf %>%
group_by(bigram) %>%
arrange(desc(tf_idf)) %>%
top_n(10)
## Selecting by tf_idf
## # A tibble: 42,312 x 6
## # Groups: bigram [14,053]
## doc_id bigram n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 doc119 material adverse 1 1 1.11 1.11
## 2 doc120 material adverse 1 1 1.11 1.11
## 3 doc82 public limited 12 0.096 4.27 0.410
## 4 doc43 december comparable 31 0.0891 3.58 0.319
## 5 doc105 software subscription 73 0.0996 2.97 0.296
## 6 doc103 holding annual 16 0.0669 3.98 0.267
## 7 doc20 stock indian 20 0.0604 4.27 0.258
## 8 doc19 stock indian 24 0.0578 4.27 0.247
## 9 doc98 science application 19 0.0648 3.76 0.244
## 10 doc102 holding annual 18 0.0596 3.98 0.237
## # ... with 42,302 more rows
outliers1.2 <- boxplot(bi_tf_idf$tf_idf, ylab = "TF-IDF")$out
# check statistical summary
summary(bi_tf_idf$tf_idf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000118 0.0078163 0.0115818 0.0136100 0.0164240 1.1079582
# drop the rows containing outliers
bi_tf_idf_2 <- bi_tf_idf %>% filter(tf_idf <= 0.2 & tf_idf >= 0.00002)
bi_tf_idf_2 %>%
group_by(bigram) %>%
arrange(desc(tf_idf)) %>%
top_n(10)
## Selecting by tf_idf
## # A tibble: 42,301 x 6
## # Groups: bigram [14,050]
## doc_id bigram n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 doc99 science application 17 0.0514 3.76 0.193
## 2 doc176 month month 15 0.0758 2.54 0.192
## 3 doc70 information security 19 0.0688 2.73 0.188
## 4 doc177 month month 14 0.0722 2.54 0.183
## 5 doc89 impact special 8 0.0408 3.98 0.163
## 6 doc99 growth contraction 13 0.0393 3.98 0.156
## 7 doc137 waiver exclusivity 10 0.0365 4.27 0.156
## 8 doc70 cyber incident 8 0.0290 5.37 0.156
## 9 doc187 interactive solution 11 0.0330 4.68 0.155
## 10 doc100 growth contraction 15 0.0376 3.98 0.150
## # ... with 42,291 more rows
# joining data tables
token_df_bi2 = token_df_bi %>% semi_join(bi_tf_idf_2, by = c("doc_id","bigram"))
# save file
saveRDS(token_df_bi2,file = "token_df_bi2.rds")
# create a new column of sub industry
token_df_bi2 = token_df_bi2 %>%
mutate(industry=case_when(
cik %in% as.numeric(dp_outsource) ~ "Data Processing & Outsourced Services",
cik %in% as.numeric(app_software) ~ "Application Software",
cik %in% as.numeric(it_con) ~ "IT Consulting & Other Services",
cik %in% as.numeric(semi_con) ~ "Semiconductors",
cik %in% as.numeric(tech_hw) ~ "Technology Hardware, Storage & Peripherals"
)) %>%
mutate(industry=industry %>% as.factor())
# word counts
industry_bi <- token_df_bi2 %>%
count(industry, bigram, sort = TRUE)
# total word counts
total_industry_bi <- industry_bi %>%
group_by(industry) %>%
summarize(total = sum(n))
# left join
industry_bi <- industry_bi %>%
left_join(total_industry_bi)
# TF-IDF calculation
industry_tf_idf_bi <- industry_bi %>%
bind_tf_idf(bigram,industry,n) %>%
select(-total)
options(scipen = 100
)
industry_tf_idf_bi %>%
group_by(industry) %>%
slice_max(tf_idf,n=10,with_ties = F) %>%
ungroup() %>%
mutate(industry = swr(industry), row = row_number(), word2 = fct_reorder(bigram, tf_idf)) -> n
n %>%
ggplot(aes(tf_idf, word2, fill = industry)) +
geom_col(show.legend = FALSE) +
facet_wrap(~industry, ncol = 3, scales = "free") +
labs(x = "tf-idf", y = "Top 10 Words")
# remove unused files
remove(postagged)
remove(token_df_all)
remove(token_df_bi)
remove(token_df2)
remove(bi_tf_idf)
remove(bi_words)
This section illustrates how text-derived features, especially polarity, are related to the stock price difference. In addition, [@Mckay] found that quarterly conference call tone is a significant estimator of unusual stock returns. Furthermore, [@Griffin] evidenced that investors respond to EDGAR 10-K filings in a short period (0-2 days) upon the filings are available. However, we would like to see a longer-term effect. Hence, we will use a time windows of 7 days i.e., a week, and inspect whether they have a similar effect or not.
The approach starts with retrieving closing stock price for pre-filing and post-filing dates. In addition, we create a new feature called stock price difference which is defined as a difference between 2 different time windows: a) pre-filing date: 7 days before the filing date, which is the date that 10-K form is submitted on EDGAR, and b) post-filing date, which is the date after 5 days of the filing date. Next, related-text variables are extracted, which are sentiment, polarity, formality, and readability. Lastly, to inspect the relationship and the effect on the price difference, several regression models are implemented using fixed effect, random effect, and linear multiple regression models. The approach is summarized in the figure below.
Figure 5 Summary of sentiment analysis approach
cor(text_df7[, c("SentimentLM", "SentimentHE", "SentimentQDAP", "SentimentGI")])
## SentimentLM SentimentHE SentimentQDAP SentimentGI
## SentimentLM 1.0000000 0.11639660 0.36864027 0.4869805
## SentimentHE 0.1163966 1.00000000 -0.08281281 -0.2691861
## SentimentQDAP 0.3686403 -0.08281281 1.00000000 0.8110296
## SentimentGI 0.4869805 -0.26918607 0.81102960 1.0000000
text_df7 %>% pivot_longer(cols = c("SentimentGI","SentimentLM","SentimentQDAP","SentimentHE"), names_to = "dict", values_to = "sentiment") %>%
ggplot(aes(x = as.numeric(year), y = sentiment, group=dict, colour = dict)) +
geom_point(show.legend = F) +
geom_smooth() +
facet_wrap(~dict, ncol = 1, scales = "free_y")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
text_df7 %>% pivot_longer(cols = c("SentimentGI","SentimentLM","SentimentQDAP","SentimentHE"), names_to = "dict", values_to = "sentiment") %>%
ggplot(aes(x = as.numeric(year), y = sentiment, group=dict, colour = dict)) +
geom_point(show.legend = F) +
geom_smooth() +
facet_wrap(~industry, ncol = 2, scales = "free_y")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# create a new data frame retrieving polarity values
sent_df = tibble()
for (i in 1:nrow(text_df7)){
print(i)
t = text_df7$text[i] %>% get_sentences() %>% sentimentr::sentiment_by()
sent_df = bind_rows(sent_df, t)
}
sent_df = sent_df %>% mutate(doc_id = paste0("doc",row_number()))
# joining data
text_df8 = text_df7 %>% left_join(sent_df) %>% select(-sd,-element_id)
## Joining, by = "doc_id"
cor(text_df8[, c("SentimentLM", "SentimentHE", "SentimentQDAP", "SentimentGI", "ave_sentiment")])
## SentimentLM SentimentHE SentimentQDAP SentimentGI ave_sentiment
## SentimentLM 1.0000000 0.11639660 0.36864027 0.4869805 -0.10061405
## SentimentHE 0.1163966 1.00000000 -0.08281281 -0.2691861 0.47655258
## SentimentQDAP 0.3686403 -0.08281281 1.00000000 0.8110296 -0.04365885
## SentimentGI 0.4869805 -0.26918607 0.81102960 1.0000000 -0.22270385
## ave_sentiment -0.1006140 0.47655258 -0.04365885 -0.2227038 1.00000000
# box plot
text_df8 %>% ggplot(aes(x = industry, y = ave_sentiment, color = industry)) + labs(x="Industry", y="Polarity") +
geom_boxplot() + scale_x_discrete(labels = NULL, breaks = NULL)
text_df8 %>%
filter(!year == "2020") %>%
ggplot(aes(x = as.numeric(year), y = ave_sentiment)) +
geom_point(show.legend = F) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
text_df8 %>%
group_by(year) %>%
summarize(sent_avg = mean(ave_sentiment)) %>%
filter(!year == "2020") %>% ggplot(aes(x=as.numeric(year),y=sent_avg)) + geom_line() + xlab("Year") + ylab("Average Polarity")
## `summarise()` ungrouping output (override with `.groups` argument)
text_df8 %>%
filter(!year == "2020") %>%
ggplot(aes(x = as.numeric(year), y = ave_sentiment, group=industry, colour=industry)) +
geom_point(show.legend = F) +
geom_smooth() +
facet_wrap(~industry, ncol = 2, scales = "free_y") +
xlab("Year") + ylab("Average Polarity")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# xxx
text_df8 %>%
group_by(year,industry) %>%
summarize(sent_avg = mean(ave_sentiment)) %>%
filter(!year == "2020") %>% ggplot(aes(x=as.numeric(year),y=sent_avg, group=industry, colour = industry)) + geom_line() + xlab("Year") + ylab("Average Polarity")
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
# new data frame
text_df9 = text_df8 %>% select(-text,-documents_pos_tagged, -WordCount,-word.count,-cik,-NegativityGI,-NegativityHE,-NegativityLM,-NegativityQDAP,-PositivityGI,-PositivityHE,-PositivityLM,-PositivityQDAP,-RatioUncertaintyLM)
# filter only numerical features
text_df9 %>% select_if(is.numeric) -> df_numeric
correlate(df_numeric) -> cordata
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
# explore variables which have higher correlation value than |0.1|
cordata %>% select(term, sp_diff) %>% filter(sp_diff > 0.1 | sp_diff < -0.1) %>% arrange(desc(sp_diff))
## # A tibble: 0 x 2
## # ... with 2 variables: term <chr>, sp_diff <dbl>
# non-numerical data set
df_etc = text_df9 %>% select_if(negate(is.numeric)) %>% select(-comp_name,-form_type,-access_no,-doc_id,-filing_date,-date_before,-date_after,-date_after_wd) %>% mutate(comp = as.factor(comp), year = year(as.Date(as.character(year), format = "%Y")))
# numerical data set
df_numeric2 = df_numeric %>% select(-sentence.count,-sp_before,-sp_after)
preproc2 <- preProcess(df_numeric2, method=c("range"))
norm2 <- predict(preproc2, df_numeric2)
df_reg = cbind(norm2, df_etc)
# explore the distribution
hist(df_reg$formality)
hist(df_reg$sp_diff)
# relationship
df_reg %>% select(formality,sp_diff) %>% na.omit() %>% ggplot(aes(x=formality,y= sp_diff))+geom_smooth(method="loess",se = F)+ geom_point() + labs(x="Formality", y="Price difference", subtitle="Price Difference and Formality Relationship")
## `geom_smooth()` using formula 'y ~ x'
hist(df_reg$FK_grd.lvl)
# check the statistical summary of FK_grd.lvl
df_reg$FK_grd.lvl %>% summary(.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2717 0.3437 0.3557 0.4076 1.0000
# plot reading grade level against the price diff
df_reg %>% select(FK_grd.lvl, sp_diff) %>% na.omit() %>% ggplot(aes(x=FK_grd.lvl,y= sp_diff))+geom_smooth(method="loess",se = F)+geom_point() + labs(x="Reading grade level", y="Price difference", subtitle="Price Difference and Reading Grade Level Relationship")
# price diff vs FK reading ease
df_reg %>% select(FK_read.ease,sp_diff) %>% na.omit() %>% ggplot(aes(x=FK_read.ease,y= sp_diff))+geom_smooth(method="loess",se = F)+geom_point() + labs(x="Reading ease", y="Price difference", subtitle="Price Difference and Reading Ease Relationship")
# price diff vs polarity
df_reg %>% filter(!year == "2020") %>% ggplot(aes(x=ave_sentiment,y=sp_diff)) + geom_point() + stat_smooth() + labs(x="Polarity", y = "Price Difference")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_reg %>% filter(!year == "2020") %>% mutate(year = as.numeric(as.character(year))) %>% ggplot(aes(x=year,y=sp_diff)) + geom_point() + stat_smooth() + labs(x="Year", y = "Price Difference")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_reg %>% filter(!year == "2020") %>% mutate(year = as.numeric(as.character(year))) %>% ggplot(aes(x=industry,y=sp_diff,color = industry)) + geom_boxplot() + labs(x="Industry", y = "Price Difference") + scale_x_discrete(labels = NULL, breaks = NULL)
df_reg %>% count(year, sort=T)
df_reg = df_reg %>% filter(!year == '2020')
# original data converting ‘year’ as factor
df_reg = df_reg %>% mutate(year = as.factor(year))
# create panel data frame
df_reg2 = pdata.frame(df_reg, index = c("comp", "year"))
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## The following object is masked from 'package:NCmisc':
##
## Dim
## This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
##
## Attaching package: 'mgcv'
## The following object is masked from 'package:stm':
##
## s
# Non-linear model
model0 <- gam(sp_diff ~ s(ave_sentiment) + industry + year + formality + FK_read.ease, data = df_reg)
# OLS model
model01 <- lm(sp_diff ~ ave_sentiment + industry + year + formality + FK_read.ease, data = df_reg)
# fixed effect model
model1 <- plm(sp_diff ~ ave_sentiment + industry + formality + FK_read.ease, data = df_reg2,
index = c("comp", "year"),
effect = "twoways", model = "within")
# random effect model
model2 <- plm(sp_diff ~ ave_sentiment + industry + formality + FK_read.ease, data = df_reg2,
index = c("comp", "year"),
effect = "twoways", model = "random")
# pooled model
model3 <- plm(sp_diff ~ ave_sentiment + industry + formality + FK_read.ease, data = df_reg2,
index = c("comp", "year"),
effect = "twoways", model = "pooling")
# test fixed-effect vs random effect models
phtest(model1,model2) # p-value > 0.05, use random effect
##
## Hausman Test
##
## data: sp_diff ~ ave_sentiment + industry + formality + FK_read.ease
## chisq = 0.0047838, df = 3, p-value = 0.9999
## alternative hypothesis: one model is inconsistent
# Breusch-Pagan Lagrange multiplier test
plmtest(model3, type=c("bp")) # p-value >0.05, use OLS
##
## Lagrange Multiplier Test - (Breusch-Pagan) for unbalanced panels
##
## data: sp_diff ~ ave_sentiment + industry + formality + FK_read.ease
## chisq = 3.2726, df = 1, p-value = 0.07044
## alternative hypothesis: significant effects
model4 <- lm(sp_diff ~ ave_sentiment + industry + year + formality, data = df_reg)
model5 <- lm(sp_diff ~ ave_sentiment + industry + year, data = df_reg)
model6 <- lm(sp_diff ~ ave_sentiment + industry, data = df_reg)
stargazer::stargazer(model01,model4,model5,model6,type = "text", single.row = TRUE, # to put coefficients and standard errors on same line
no.space = TRUE, # to remove the spaces after each line of coefficients
omit.stat=c("f", "ser"),
digits = 2,
column.sep.width = "0pt",
font.size = "tiny", # to make font size smaller
out = "modelresult0.html",
covariate.labels=c("sentiment","DP","IT Consult","Semi Con","Tech Hardware",
"2011","2012","2013","2014","2015","2016","2017","2018","2019"," formality","readingeaselevel","Intercept"))
Figure 6 Summary of regression models
The section shows how to construct the pipeline for topic modelling. It starts with data preparation and corpus construction. Next, searching for the optimal number of topics are required using semantic coherence and exclusivity as metrics for the selection. After that, the structural topic model (stm) is fitted using the optimal number of topics and hence the analysis is generated, where it presents the associations of topic-document and topic-word. In addition, the marginal effect is estimated to see the relationship between topics and related features (covariates) such as the price difference, polarity, year. Lastly, multiple linear regression models are implemented to see the effect of topics on the stock price difference. The approach summary can be found in the figure below.
## Data preparation
postagged = readRDS("postagged.rds")
lematized <- postagged %>% filter(upos %in% c("NOUN",
"ADJ",
"ADV"))
saveRDS(lematized, file = "lematized.rds")
# read file
lematized = readRDS("lematized.rds")
# explore top words across corpus
lematized %>% group_by(lemma) %>%
summarise(total =n()) %>%
arrange(desc(total)) %>%
top_n(30)
## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by total
## # A tibble: 30 x 2
## lemma total
## <chr> <int>
## 1 revenue 20871
## 2 fiscal 16650
## 3 tax 16181
## 4 year 15614
## 5 income 12510
## 6 cash 12052
## 7 expense 11482
## 8 net 11348
## 9 other 10732
## 10 service 9629
## # ... with 20 more rows
# create new data frame
lematized_2 = lematized %>% count(doc_id, lemma, sort=T)
lematized_2_total <- lematized_2 %>%
group_by(doc_id) %>%
summarize(total = sum(n))
## `summarise()` ungrouping output (override with `.groups` argument)
lematized_2 <- lematized_2 %>%
left_join(lematized_2_total)
## Joining, by = "doc_id"
# save file
saveRDS(lematized_2, file = "lematized_2.rds")
# TF-IDF calculation
lem_tf_idf <- lematized_2 %>% bind_tf_idf(lemma,doc_id,n) %>% select(-total)
# plot a histogram
hist(lem_tf_idf$tf_idf,breaks = 50,main="TF-IDF plot", xlim = c(0,1))
# statistical summary
summary(lem_tf_idf$tf_idf) # outliers are observed considering Q1-Q4 with the median equals only 0.00027.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0001357 0.0002704 0.0005544 0.0005293 0.4229301
# let's filter the data
lem_tf_idf_2 <- lem_tf_idf %>%
filter(tf_idf<=0.005 & tf_idf>=0.0001)
hist(lem_tf_idf_2$tf_idf,breaks = 20,main="TF-IDF plot", xlim = c(0,0.005))
# let's explore the top 30 words
lem_tf_idf_2 %>%
group_by(lemma) %>%
arrange(desc(tf_idf)) %>%
top_n(30)
## Selecting by tf_idf
## # A tibble: 76,738 x 6
## # Groups: lemma [7,042]
## doc_id lemma n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 doc108 big 22 0.00178 2.81 0.00500
## 2 doc123 escrow 33 0.00236 2.11 0.00499
## 3 doc104 fiscal 282 0.0257 0.194 0.00499
## 4 doc199 wafer 19 0.00342 1.46 0.00499
## 5 doc203 atmel 7 0.00125 3.98 0.00499
## 6 doc203 issc 7 0.00125 3.98 0.00499
## 7 doc114 hedge 23 0.00380 1.31 0.00498
## 8 doc86 card 21 0.00504 0.989 0.00498
## 9 doc162 ancillary 5 0.00132 3.76 0.00497
## 10 doc201 assembly 19 0.00341 1.46 0.00497
## # ... with 76,728 more rows
mystopwords <- tibble(word =c("fiscal", "other", "related", "business", "rate", "product", "note", "payment", "table", "contract", "foreign", "currency", "such", "charge", "time", "purchase", "end","research","support", "policy", "available","party", "lease", "requirement", "recognition", "most","significant", "proceed","solutions","excess","issuance","connection","government","day","act","strategy","state","difference","march","country","experience","consulting","action","page","NA", "revenue", "income","tax","cash","cost","net","financial","net","asset","expense","result","total","share","service","company","increase","management","sale"))
allsw = stop_words %>% full_join(mystopwords)
## Joining, by = "word"
lemma = lem_tf_idf_2 %>% select(doc_id,lemma) %>% anti_join(allsw, by = c("lemma" = "word"))
# construct a sentence from lemmatized
lemma_2 <- lemma %>% group_by(doc_id) %>% summarise(text_tagged = paste(lemma,collapse = " "))
## `summarise()` ungrouping output (override with `.groups` argument)
# create a data frame for the coming analysis
text_df10 <- text_df8 %>% left_join(lemma_2)
## Joining, by = "doc_id"
# inspect NA values
na_rows = subset(text_df10, is.na(text_df10$text_tagged)) %>% select(comp_name,comp,filing_date,text_tagged)
# remove unused data
remove(postagged)
## Warning in remove(postagged): object 'postagged' not found
remove(lematized)
remove(lematized_2)
remove(lem_tf_idf)
remove(lem_tf_idf_2)
In this part, we generate 4 new variables since our corpus and the coming analysis will be on industry and year levels. Those are aggregated text column (tp), the average of difference in stock price (sp_avg), the average formality scores (form_avg), and the average polarity (sent_avg).
toprocess = text_df10 %>%
select(industry, year,text_tagged,sp_diff,ave_sentiment,formality) %>%
group_by(industry,year) %>%
summarize(tp = paste(text_tagged, collapse = " "),
sp_avg = mean(sp_diff),
form_avg = mean(formality),
sent_avg = mean(ave_sentiment)) %>%
filter(!year == "2020") %>%
mutate(doc_id = paste("doc",row_number()),
year = as.numeric(year))
## `summarise()` regrouping output by 'industry' (override with `.groups` argument)
# prepare the corpus used for stm estimation
processed <- textProcessor(toprocess$tp,
metadata = toprocess,
customstopwords = allsw$word,
stem = F)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Remove Custom Stopwords...
## Removing numbers...
## Creating Output...
plotRemoved(processed$documents, lower.thresh=seq(1,200, by=100))
# keep those words that appear more than 1% in the document corpus
threshold1 <- round(1/100 * length(processed$documents),0)
result1 <- prepDocuments(processed$documents,
processed$vocab,
processed$meta,
lower.thresh = threshold1)
# keep those words that appear more than 5% in the document corpus
threshold5 <- round(5/100 * length(processed$documents),0)
result5 <- prepDocuments(processed$documents,
processed$vocab,
processed$meta,
lower.thresh = threshold5)
## Removing 2298 of 6501 terms (3036 of 72241 tokens) due to frequency
## Your corpus now has 50 documents, 4203 terms and 69205 tokens.
numtopics1 <- stm::searchK(result1$documents,result1$vocab,K=0, prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result1$meta,heldout.seed = 1) #100
numtopics5 <- stm::searchK(result5$documents,result5$vocab,K=0, prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result5$meta,heldout.seed = 1) #67
numtopics1 <- stm::searchK(result1$documents,result1$vocab,K=c(10,25,40,55,70,85,100), prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result1$meta,heldout.seed = 1)
numtopics5 <- stm::searchK(result5$documents,result5$vocab,K=c(10,20,30,40,50,60,70), prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result5$meta,heldout.seed = 1)
plot(numtopics1) # k = 7-12
plot(numtopics5) # k = 7-12
numtopics1.1 <- stm::searchK(result1$documents,result1$vocab,K=seq(from=5, to=10, by=1), prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result1$meta,heldout.seed = 1)
numtopics5.1 <- stm::searchK(result5$documents,result5$vocab,K=seq(from=5, to=10, by=1), prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result5$meta,heldout.seed = 1)
#par(mar=c(1,1,1,1))
plot(numtopics1.1) # k = 8
knitr::kable(numtopics1.1$results)
| K | exclus | semcoh | heldout | residual | bound | lbound | em.its |
|---|---|---|---|---|---|---|---|
| 5 | 8.421397 | -7.830872 | -7.639919 | 26.84649 | -884444.4 | -884439.6 | 273 |
| 6 | 8.580645 | -6.902828 | -7.644201 | 1.307503 | -881438.6 | -881432 | 100 |
| 7 | 8.521526 | -4.640247 | -7.591978 | 3.018698 | -874803.8 | -874795.3 | 63 |
| 8 | 8.90089 | -5.456275 | -7.646804 | 2.106024 | -873097 | -873086.4 | 139 |
| 9 | 8.793623 | -6.06389 | -7.88957 | 2.6335 | -872954.5 | -872941.7 | 85 |
| 10 | 8.88515 | -5.893097 | -8.303825 | 3.166167 | -871926.5 | -871911.4 | 57 |
plot(numtopics5.1) # k = 5
knitr::kable(numtopics5.1$results)
| K | exclus | semcoh | heldout | residual | bound | lbound | em.its |
|---|---|---|---|---|---|---|---|
| 5 | 8.093306 | -4.3695 | -7.607057 | 0.9906994 | -852751 | -852746.2 | 191 |
| 6 | 8.339352 | -4.942364 | -7.602807 | 1.02831 | -851002.6 | -850996 | 104 |
| 7 | 9.086524 | -12.82152 | -7.545167 | 2.092363 | -845654.4 | -845645.9 | 113 |
| 8 | 8.949104 | -7.818806 | -7.682021 | 1.366209 | -844172.8 | -844162.2 | 171 |
| 9 | 9.010704 | -6.677006 | -8.021299 | 1.45672 | -841532.6 | -841519.8 | 183 |
| 10 | 8.888684 | -5.403103 | -8.24289 | 2.544388 | -839108.7 | -839093.6 | 154 |
# fitting stm model using the number of topics (K) = 8 with threshold = 1%
suppressWarnings(library(stm))
finfit1 <- stm::stm(documents = result1$documents,
vocab = result1$vocab,
K = 8,
prevalence = ~sp_avg + sent_avg + year + factor(industry),
#max.em.its = 75,
data = result1$meta,
reportevery=10,
gamma.prior = "L1",
sigma.prior = 0.9,
init.type = "Spectral",seed = 123)
suppressWarnings(library(stm))
finfit5 <- stm::stm(documents = result5$documents,
vocab = result5$vocab,
K = 5,
prevalence = ~sp_avg + sent_avg + year + factor(industry),
#max.em.its = 100,
data = result5$meta,
reportevery=10,
gamma.prior = "L1",
sigma.prior = 0.9,
init.type = "Spectral",seed = 123)
suppressWarnings(library(ggplot2))
suppressWarnings(library(plotly))
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:igraph':
##
## groups
## The following object is masked from 'package:ggmap':
##
## wind
## The following object is masked from 'package:qdap':
##
## %>%
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:sentimentr':
##
## highlight
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
exsem_m1 <- as.data.frame(cbind(c(1:8),exclusivity(finfit1), semanticCoherence(model=finfit1, result1$documents), "Model1"))
exsem_m5 <- as.data.frame(cbind(c(1:5),exclusivity(finfit5), semanticCoherence(model=finfit5, result5$documents), "Model5"))
exsem_comparison <- rbind(exsem_m1, exsem_m5)
colnames(exsem_comparison) <- c("K","Exclusivity", "SemanticCoherence", "Model")
exsem_comparison$Exclusivity<-as.numeric(as.character(exsem_comparison$Exclusivity))
exsem_comparison$SemanticCoherence<-as.numeric(as.character(exsem_comparison$SemanticCoherence))
options(repr.plot.width=7, repr.plot.height=7, repr.plot.res=100)
plotexsem <-ggplot(exsem_comparison, aes(SemanticCoherence, Exclusivity, color = Model))+geom_point(size = 2, alpha = 0.7) +
geom_text(aes(label=K), nudge_x=.05, nudge_y=.05)+
labs(x = "Semantic coherence",
y = "Exclusivity",
title = "Exclusivity VS. Semantic Coherence")
plotexsem
topic1_9 <- make.dt(finfit1, result1$meta)
View(topic1_9[,-c("tp")])
suppressWarnings(library(tidytext)) # prevent warning error
theta <- tidytext::tidy(finfit1, matrix = "theta")
theta_df1 <- theta[theta$document%in%c(1:17),] # select the first 17 documents.
thetaplot1 <- ggplot(theta_df1, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) +
geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ document, ncol = 3) +
labs(title = "Theta values per document",
y = expression(theta), x = "Topic")
# plot
thetaplot1
theta_df2 <- theta[theta$document%in%c(18:34),] # select the next 17 documents.
thetaplot2 <- ggplot(theta_df2, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) +
geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ document, ncol = 3) +
labs(title = "Theta values per document",
y = expression(theta), x = "Topic")
# plot
thetaplot2
theta_df3 <- theta[theta$document%in%c(35:51),] # select the next 17 documents.
thetaplot3 <- ggplot(theta_df3, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) +
geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ document, ncol = 3) +
labs(title = "Theta values per document",
y = expression(theta), x = "Topic")
# plot
thetaplot3
mod.corr <- topicCorr(finfit1)
plot(mod.corr)
topicQuality(finfit1,documents=result1$documents)
## [1] -1.235813 -7.276399 -4.163224 -5.590105 -4.152409 -9.224564 -1.967508
## [8] -1.369184
## [1] 8.957245 9.798218 8.528626 8.960868 9.091806 9.283643 7.875148 8.383260
# topic distribution and top 5 word association
plot.STM(finfit1, "summary", n=5)
labelTopics(finfit1, topics=c(5,8,7,1,2), n=10)
## Topic 5 Top Words:
## Highest Prob: information, partially, reportable, spending, estimate, measure, termination, environment, repurchase, cancellation
## FREX: information, achievement, staff, engagement, belief, billing, contractually, efficiency, effectiveness, team
## Lift: abacus, abr, adaptable, adomain, adversary, aeronautic, affair, afghanistan, agnostic, airborne
## Score: information, predict, skill, quota, bid, unbilled, absent, sciences, involuntary, booking
## Topic 8 Top Words:
## Highest Prob: technical, marketing, contents, customer, equivalents, stock, term, future, period, change
## FREX: complementary, training, ratably, upgrade, accuracy, vsoe, functionality, trademark, human, combination
## Lift: advertiser, ansy, archive, attract, basel, behance, beta, beverage, boston, browser
## Score: physically, unlimited, edition, detailead, budgetary, subscriber, publish, invoice, entitle, landlord
## Topic 7 Top Words:
## Highest Prob: operating, amount, growth, capital, fee, period, change, statement, institution, estimate
## FREX: fraud, loyalty, brazil, retail, migration, positively, regional, processing, house, institution
## Lift: absorb, alter, ancillary, apparel, brande, cbsm, checkwriter, custody, dishonored, drop
## Score: loyalty, delinquent, valuable, fraud, sponsorship, signature, pool, advisory, transactional, servicer
## Topic 1 Top Words:
## Highest Prob: repatriation, dividend, payable, exchange, debt, activity, adjustment, allowance, impairment, program
## FREX: analysis, taxation, potentially, repatriation, oem, rata, vigorously, unanticipate, eventual, explanation
## Lift: accompany, acs, actuary, adjacency, adjusted, advantaged, adviser, affiliates, airplane, airpod
## Score: analysis, oem, products, inventories, accompany, cents, connectkey, consumable, cooperation, dfe
## Topic 2 Top Words:
## Highest Prob: limited, ordinary, indefinitely, worldwide, respect, appeal, supplier, reference, determinable, headcount
## FREX: supplier, appeal, limited, ordinary, word, indefinitely, repair, worldwide, determinable, irs
## Lift: aftershock, alloy, amk, balanced, broadening, cessation, clearance, committe, discussion, disks
## Score: discussion, inventories, desktop, audio, inch, thailand, factory, handset, flash, intensive
plot.STM(finfit1, "labels", topics=c(5,8,7,1,2), label="frex", n=10, width=60)
beta <- tidytext::tidy(finfit1)
options(repr.plot.width=7, repr.plot.height=8, repr.plot.res=100)
beta %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities by topic")
thoughts5 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 5)$docs[[1]]
thoughts8 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 8)$docs[[1]]
thoughts7 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 7)$docs[[1]]
thoughts1 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 1)$docs[[1]]
thoughts2 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 2)$docs[[1]]
plotQuote(thoughts5, width = 100, maxwidth = 600,text.cex=1,main = "Topic5")
plotQuote(thoughts8, width = 100, maxwidth = 600,text.cex=1,main = "Topic 8")
plotQuote(thoughts7, width = 100, maxwidth = 600,text.cex=1,main = "Topic 7")
plotQuote(thoughts1, width = 100, maxwidth = 600,text.cex=1,main = "Topic 1")
plotQuote(thoughts2, width = 100, maxwidth = 600,text.cex=1,main = "Topic 2")
plot.STM(finfit1, type = "perspectives", topics=c(8,7), label="frex", n=10, width=100)
plot.STM(finfit1, type = "perspectives", topics=c(8,5), label="frex", n=10, width=60)
plot.STM(finfit1, type = "perspectives", topics=c(7,5), label="frex", n=16, width=60)
– We see that there are many overlapping among words. The distinctive word for topic 5 is information while for topic 8, it is technical.
topic_labels <- c("T1",
"T2",
"T3",
"T4",
"T5",
"T6",
"T7",
"T8")
effect1 = stm::estimateEffect(~ sp_avg + sent_avg + year + factor(industry), stmobj = finfit1, metadata = result1$meta)
effect1_sum = summary(effect1)
# effect1_sum$tables
plot(effect1, covariate = "sp_avg",
topics = c(1:8),
model = finfit1, method = "difference",
cov.value1 = "+10", cov.value2 = "-10",
xlab = "Low Difference … High Difference",
xlim = c(-0.1,0.1),
main = "Marginal Effects of Price Difference",
#custom.labels =topic_labels,
labeltype = "custom")
plot(effect1, covariate = "sent_avg",
topics = c(1:8),
model = finfit1, method = "difference",
cov.value1 = "+5", cov.value2 = "-5",
xlab = "Low Polarity … High Polarity",
xlim = c(-0.3,0.3),
main = "Marginal Effects of Polarity",
#custom.labels =topic_labels,
labeltype = "custom")
for(i in 1:length(topic_labels)){
plot(effect1, covariate = "year",
topics = i,
model = finfit1, method = "continuous",
# For this plotting we get the uper quantile
# and low quantile of the price
xlab = "Year",
# xlim = c(0,800),
# ylim = c(-0.1,0.1),
main = topic_labels[i],
printlegend = FALSE,
custom.labels =topic_labels[i],
labeltype = "custom")
}
thetadf <- as.data.frame(finfit1$theta)
colnames(thetadf) <- topic_labels
reginfo <- cbind(result5$meta, thetadf) %>%
filter(!year == "2020")
reginfo$documents <- NULL
model1.information <- lm(sp_avg~sent_avg+year+factor(industry),data=reginfo)
# adding xxx
model2.information <- lm(sp_avg~sent_avg+form_avg+year+factor(industry),data=reginfo)
# adding xxx
model3.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1,data=reginfo)
# adding xxx
model4.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2,data=reginfo)
# adding xxx
model5.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3,data=reginfo)
# adding xxx
model6.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4,data=reginfo)
# adding xxx
model7.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4+T5,data=reginfo)
# adding xxx
model8.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4+T5+T6,data=reginfo)
# adding xxx
model9.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4+T5+T6+T7,data=reginfo)
# adding xxx
model10.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4+T5+T6+T7+T8,data=reginfo)
stargazer::stargazer(model1.information,model2.information,model3.information,model4.information,model5.information,model6.information,model7.information, model8.information,model9.information,model10.information,
single.row = TRUE, # to put coefficients and standard errors on same line
no.space = TRUE, # to remove the spaces after each line of coefficients
omit.stat=c("f", "ser"),
digits = 2,
column.sep.width = "0pt",
font.size = "tiny" ,# to make font size smaller
type = "text", out = "modelresult.html",
covariate.labels=c("avg. sentiment","avg. formality","year",
"DP","IT Consult","Semi Con","Tech Hardware","T1","T2","T3","T4","T5","T6","T7","T8","Intercept"))
# all packages used in this project
subset(data.frame(sessioninfo::package_info()), attached==TRUE, c(package, loadedversion))
## package loadedversion
## BatchGetSymbols BatchGetSymbols 2.6.1
## broom broom 0.7.2
## caret caret 6.0-86
## cld3 cld3 1.4.1
## cluster cluster 2.1.0
## corrr corrr 0.4.3
## dplyr dplyr 1.0.2
## edgar edgar 2.0.3
## edgarWebR edgarWebR 1.1.0
## forcats forcats 0.5.0
## Formula Formula 1.2-4
## geometry geometry 0.4.5
## ggmap ggmap 3.0.0
## ggplot2 ggplot2 3.3.2
## ggraph ggraph 2.0.5
## gridExtra gridExtra 2.3
## Hmisc Hmisc 4.5-0
## hunspell hunspell 3.0.1
## igraph igraph 1.2.6
## KernSmooth KernSmooth 2.23-17
## knitr knitr 1.30
## lattice lattice 0.20-41
## lda lda 1.4.2
## lmtest lmtest 0.9-38
## lubridate lubridate 1.7.9
## magrittr magrittr 1.5
## matrixStats matrixStats 0.57.0
## mctest mctest 1.3.1
## mgcv mgcv 1.8-33
## NCmisc NCmisc 1.1.6
## nlme nlme 3.1-149
## NLP NLP 0.2-1
## PerformanceAnalytics PerformanceAnalytics 2.0.4
## plm plm 2.4-1
## plotly plotly 4.9.2.1
## purrr purrr 0.3.4
## qdap qdap 2.4.3
## qdapDictionaries qdapDictionaries 1.0.7
## qdapRegex qdapRegex 0.7.2
## qdapTools qdapTools 1.3.5
## quantmod quantmod 0.4.17
## RColorBrewer RColorBrewer 1.1-2
## Rcpp Rcpp 1.0.5
## RcppArmadillo RcppArmadillo 0.10.1.0.0
## readr readr 1.4.0
## readtext readtext 0.80
## rsvd rsvd 1.0.5
## Rtsne Rtsne 0.15
## rvest rvest 0.3.6
## scales scales 1.1.1
## SentimentAnalysis SentimentAnalysis 1.3-4
## sentimentr sentimentr 2.7.1
## slam slam 0.1-48
## SnowballC SnowballC 0.7.0
## stm stm 1.3.6
## stringi stringi 1.5.3
## stringr stringr 1.4.0
## survival survival 3.2-7
## textcat textcat 1.0-7
## textshape textshape 1.7.1
## tibble tibble 3.0.4
## tidyquant tidyquant 1.0.3
## tidyr tidyr 1.1.2
## tidytext tidytext 0.3.1
## tidyverse tidyverse 1.3.0
## timeDate timeDate 3043.102
## tm tm 0.7-8
## topicmodels topicmodels 0.2-12
## TTR TTR 0.24.2
## udpipe udpipe 0.8.5
## widyr widyr 0.1.3
## wordcloud wordcloud 2.6
## xml2 xml2 1.3.2
## xts xts 0.12.1
## zoo zoo 1.8-8